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Abstract — In this paper, we present a complex network ap- 
proach to the study of software engineering. We have found 
universal network patterns in a large collection of object-oriented 
(OO) software systems written in C++ and Java. All the systems 
analyzed here display the small-world behavior, that is, the 
average distance between any pair of classes is very small 
even when coupling is low and cohesion is high. In addition, 
the structure of OO software is a very heterogeneous network 
characterized by a degree distribution following a power-law 
with similar exponents. We have investigated the origin of these 
universal patterns. Our study suggest that some features of OO 
programing languages, like encapsulation, seem to be largely 
responsible for the small-world behavior. On the other hand, 
software heterogeneity is largely independent of the purpose and 
objectives of the particular system under study and appears to 
be related to a pattern of constrained growth. A number of 
software engineering topics may benefit from the present ap- 
proach, including empirical software measurement and program 
comprehension. 

I. Introduction 

It is over fifteen years since Norman Fenton outlined the 
need for a scientific basis of software measurement [1]. Such 
a theory is a prerequisite for any useful quantitative approach 
to software engineering, although little attention has been 
received from both practitioners and researchers. Measurement 
is the process that assigns numbers or symbols to attributes of 
real-world entities. Unfortunately, many empirical studies of 
software measurements lack a forecast system that combines 
measurements and parameters in order to make quantitative 
predictions [2]. How we can overcome these limitations? 

Here we present a new approach to software engineering 
based on recent advances in complex networks [3], [4], 
[5]. We study graph abstractions of software designs, where 
nodes represent software entities (i.e., classes and/or methods) 
and edges represent static relationships between them (i.e., 
inheritance and association). We measure graph attributes of 
software designs in order to find universal patterns of software 
organization. Graph measures are not anew to software [2]. 
Empirical software studies assume there is a correlation be- 
tween software design measures (i.e., lines of code, coupling, 
cohesion, modularity) and external features (like software 
reliability or development effort). Although good agreement 
has been observed in some cases, it is difficult to know if 
empirical mappings hold in general or not without appropriate 
models [1]. We have found that some graph measurements 
of software structures are (statistically) predictable. Moreover, 
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Fig. 1 . (A) A simple UML class graph. (B) And its equivalent class graph (see 
text). Every class maps to a single node in the class graph. Links in the class 
graph represent member and inheritance relationships. Dynamic relationships 
are not represented in the class graph (i.e, the "uses" relationship depicted 
with discontinuous link). We consider both the directed (B) and undirected 
(not shown) versions of a class graph. 



they are within a definite range of values. Intriguingly, these 
patterns are almost independent of functionality and other 
external features. It seems that strong constraints limit the set 
of possible patterns that software structures can display. These 
constraints might be useful to define useful reference models 
that enable predictive software development processes. 

Object-oriented (OO) software systems display small-world 
(SW) behavior Many real systems, including the WWW, 
food webs, and cellular networks [6] are small-worlds, that 
is, they have high-degree of clustering and a small average 
distance. Another common property of OO software is that 
probability distributions of structural attributes tend to follow 
skewed distributions with long tails [7] . Heterogeneous metrics 
have been interpreted as an accident or the signal of rare, 
atypical behavior. In this context, software researchers often 
avoid heterogeneity by manipulating the original distribution. 
Unfortunately, this transformation hides important structural 
information and the true nature of OO software. Here, we show 
that the probability of a class to participate in k relationships 
follows a scaling-law, that is, software designs are scale-free 
(SF) networks. We have validated the SW and SF behavior of 
OO software in many real systems, and thus suggesting they 
are universal features of software designs. In this paper, we 
explore the origin of these patterns. Eventually, we provide 
some tentative explanations but clearly more work is needed 
in this direction. 

The regularities found here suggest that concepts and theo- 
ries developed by complex networks studies are useful in other 
software engineering contexts, like program comprehension. 



For instance, OO software and the WWW share many struc- 
tural features. Recent analyses of web graphs have show the 
existence of some key pages called hubs and authorities[8]. 
Hubs are web pages having a large number of links, like web 
directories or lists of personal pages. Authorities are pages that 
contain useful information and thus are pointed to by hubs. OO 
systems display a similar pattern, where a few (hub) classes 
have a large number of relationships. Hub classes are excellent 
starting points for the program comprehension process. A 
node centrality index might enable us to locate key software 
components very quickly in a very large source code database 
(i.e, pageranks [9]). In addition, we study a particular software 
system and suggest that we can obtain useful information 
by comparing different network representations of the same 
software system. 

This paper is structured as follows. Section II defines class 
graphs, an abstraction that captures static structural features 
of object-oriented systems. These class graphs display univer- 
sal features: they are small-worlds and scale-free networks. 
Section III investigates the intrinsic origin of small-world 
behavior in class graph, which seems to be related to the 
bipartite association between methods and classes. Section IV 
proposes that class graphs are scale-free because they evolve 
under constraints and thus claiming for an external cause. 
Finally, section V concludes the paper and outlines additional 
implications of the network patterns found here in empirical 
software engineering and distributed software development. 

II. Class Graphs 

A class graph (a software network) is a digraph D — {V, L) 
that consists of the set V of classes and the set of relationships 
L = PUS. There are two types of relationships: a membership 
relationship P — {{vi,Vj)}, i.e., read "w^ has part v/'; and a 
reflexive and transitive relationship S — {{vi,Vj)}, i.e., read 
"vi is a subclass of v/'. However, and from now on, we will 
not make any distinction between these two relationships P 
and S and we only consider the full set of Unks L. We discard 
any dynamic class relationship from the graph definition. For 
instance, method invocation (i.e., uses relationship in fig.[T]\) 
is not represented in the class graph (compare with figlTJj). 
Instead, we conceive nodes and links as black boxes hiding 
internal complexities that do not change the global structure. 
This bare-bones characterization enables us to detect global 
patterns in the static software structure. Ultimately, we hope 
that the analysis of class graphs will provide important insights 
into high-level processes of software evolution. We also define 
the undirected class graph (or undirected software network) 
G = {V,E) where E = {{vi,Vj}\{vi,Vj) e L V ivj,Vi) G L} 
is the set of edges (see figHli- 

Class graphs represent an important information space of 
OO software systems. A prerequisite for software evolution 
and maintenance is that software engineers recognize and 
understand the function performed in software. This problem 
is aggravated in large software systems, where source code 
navigation can turn easily into a bottleneck. The efficiency 
of program comprehension depends on general and new 




Fig. 2. Directed class grapli for the computer game ProRally 2002. Tliis 
scale-free and small-world network has A'^ = 1993 classes. There is a clear 
modular organization and nodes naturally cluster in different subsystems, 
i.e., 3d rendering, physical simulation, artificial intelligence, etc. Node color 
indicates the node subsystem. 

knowledge [10]. General knowledge is independent of the 
particular software application. On the other hand, new knowl- 
edge includes all the specific concepts and ideas regarding 
the particular software application. This includes knowledge 
encoded in source code, which typically comprises several 
levels of abstraction. Each level of abstraction defines an 
information space or subsets of the global information space 
representing the whole software system[ll]. These informa- 
tion spaces display an internal structure that is navigated 
by software engineers to obtain new knowledge and achieve 
program comprehension. [12] further decomposes information 
retrieval in two different strategies: Browsing and Searching. 
Browsing is an exploration of high-level software entities 
while searching aims to low-level entities. Efficient browsing 
requires an adequate software structure. For instance, modular 
software (i.e., a system that has been subdivided in disjoint 
chunks or modules with clear boundaries) enhances program 
comprehension and minimizes the impact of changes. In this 
context, we think that structural analyses of class graphs might 
be useful to assess the performance of browsing and program 
comprehension in general. 

A. Data and Methods 

We have collected a large sample of 80 different software 
systems written in Java and C++. This dataset represents 
a wide variety of different software applications and it is 
large enough to be statistically significant. We have recovered 
class graphs according to the definition given in the previous 
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Fig. 3. Log-log plots of different size relationships in class graphs. Every 
point represents a single class graph. (A) The number of links L in class 
graphs scales linearly with number of classes. (B) The connectance or 
abundance of possible links L/N{N — 1) decays as the number of classes A'^ 
increases. The observed connectance is very small (about 0.1% of all possible 
links). 



section. Actually, five systems provide full UML class dia- 
grams: ProRally 2002 (a proprietary C-H- videogame), Striker 
(C++ videogame), JDK-A and JDK-B (two largest connected 
components of Java 1.5) and Mudsi (a distributed JAVA 
application). These UML class diagrams are design documents 
released by their respective software developers. The mapping 
from UML class diagrams to class graphs is straightforward 
(see figHll. The remaining software systems represent a diverse 
repertoire of Open Source (OS) applications written in C++. 
These class graphs were reverse-engineered with a simple 
lexical analysis of the C-H-/Java source codes (see Appendix 
I). Fig. is the class graph for the C-H- code in fig. |7^. In 
Table I there is a summary of graph measurements in a subset 
of software systems. Figure |2] shows the class graph for the 
C-H- videogame ProRally 2002 (see below). 

B. Connectance and Linear Growth 

The number of links L=|L| scales with the number of classes 
N = \C\ in an almost linear way (see figOK): 



(1) 



This shows that class graphs are very sparse. In addition, 
this linear dependence between Unks and nodes means that 
every new class attaches (on average) to an approximately 
constant number of existing classes. This fits very well the 
assumption of the linear growth in software systems [13]. 
Define the richness connectance of a graph as the fraction 
of used links L compared to the number N{N — 1) of 
links in the complete graph (self-referencing is avoided). If L 
scales linearly with N, then richness connectance will decay 
approximately as (see fig. |3ji)- Linear growth does not 
allow for extensive changes to the large-scale class graph 
structure. Connectance decays very fast and network size 
quickly saturates to a constant value. This saturation has been 
associated to a pattern of increasing complexity in software 
development [14]. 



C. Class Graphs are Small-Worlds 

Watts and Strogatz found that many real networks display 
short average path length and high clustering (or nonnegligible 
cliquishness) [6]. A network displaying these properties is 
called a small world (SW). Given a node Vi with degree ki 
(i.e., the number of links attached to the node), we define node 
clustering Ci as the fraction of actual number of triangles ti 
where the node Vi participates in: 



ki{ki 1) 



(2) 



The clustering coefficient C of a graph measures the pro- 
portion of triangles in the graph: 
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where A is the adjacency matrix for the graph with Ai,j — 1 
if node Vi and Vj are connected and Ai_j = otherwise. 
For random graphs, the clustering coefficient is inversely 
proportional to the graph size: 



N 



(4) 



The clustering coefficient of a SW is significantly larger than 
the expected clustering coefficient for the random network, 
C >> Crand- Nodcs in the SW are densely connected with 
its immediate neighborhood. 

Average path length d is a measure of the global connec- 
tivity, or the mean distance dij required to navigate between 
any pair of nodes Vi and vj: 



d : 



(5) 



The average path length in random graphs is proportional 
to the logarithm of their size: 
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The average path length of a SW is as small as in the 
unrestricted random case d ~ drand, due to a few long-range 
edges (shortcuts) connecting distant regions of the network. 
Then, small average path length is compatible with a broad 
range of clustering coefficient values [6]. This is a measure 
of network spread or compactness that has been observed in 
different contexts, from the Internet to the social networks. In 
these systems, it is useful to keep d as low as possible. For 
example, shortest paths often enable faster communications. 
On the other hand, coupled oscillator systems with short 
average path lengths synchronize much faster than systems 
displaying longer paths [15], [16]. 

We have measured d and C in all the class graphs described 
above. Comparison with random predictions shows that class 
graphs are instances of small-worlds (see table I). For every 
class graph, we have observed that C >> Crand and d « 



TABLE I 
Graph Measurements 



Dataset 


N 


L 


d 


drand 


C 


Cr and 


Mudsi 


168 


241 


2,88 


4,95 


0,244 


0,017 


JDK-B 


1364 


1947 


5,97 


6,80 


0,225 


0,002 


JDK-A 


1376 


2162 


5,40 


6,28 


0,159 


0,002 


Prorally 


1993 


4987 


4,85 


4,71 


0,211 


0,003 


Striker 


2356 


6748 


5,90 


4,46 


0,282 


0,002 


gchempaint 


27 


41 


2,85 


3,26 


0,204 


0,102 


4yp 


54 


90 


3,28 


3,44 


0,069 


0,059 


Prospectus 


99 


168 


3,80 


3,77 


0,14 


0,034 


eMule 


129 


218 


3,87 


4,16 


0,237 


0,025 


Aime 


143 


319 


2,66 


3,34 


0,413 


0,031 


Openvrml 


159 


335 


3,53 


3,53 


0,08 


0,026 


gpdf 


162 


300 


4,02 


3,93 


0,303 


0,022 


Dm 


162 


254 


4,32 


4,45 


0,304 


0,019 


Bochs 


1 


jjy 










Quanta 


166 


239 


4,31 


5,03 


0,198 


0,017 


Fresco 


189 


277 


4,73 


4,89 


0,228 


0,015 


Free type 


224 


363 


4,29 


4,71 


0,193 


0,014 


Yahoopops 


373 


711 


5,57 


4,47 


0,336 


0,01 


Blender 


495 


834 


6,54 


5,14 


0,155 


0,007 


GTK 


748 


1147 


5,87 


5,91 


0,081 


0,004 


OIV 


1214 


3903 


3,99 


3,82 


0,122 


0,005 


wxWindows 


1309 


3144 


4,03 


4,62 


0,235 


0,004 


CS 


1488 


3526 


3,92 


4,74 


0,135 


0,003 
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Fig. 4. Log-log plots validating the small-world model of class graphs. 
Every point corresponds to a single class graph. (A) Average path length vs 
class graph size. Normalized path length grows with the logarithm of number 
of classes, as expected in small world networks (see text). (B) Normalized 
clustering for the systems analysed here strongly departs from the predicted 
scaling relation followed by random graphs (dashed line). Class graphs are 
much more clustered (by orders of magnitude) than their random counterparts. 



drand- In figH] wc Can clearly appreciate this result: the 
clustering coefficient in class graphs is well above the random 
expectation while the average path length is rather small. 
Actually, the values of C seem rather independent from system 
size N . This is a common feature in hierarchical networks 
[17]. 

D. Small-World and Breakdown of Modularity 

Software systems are constantly evolving. A goal of soft- 
ware engineers is to minimize the cost of software evolution 
by limiting the consequences of changes. Some designs are 
better than others in this regard because they allow software 
engineers to make small changes without propagating many 
secondary changes to other software components. In this con- 
text, it would be desirable to have reliable estimates of future 



change costs. Unfortunately, we still do not understand very 
well the properties of software development and maintenance. 
Here, we propose that software maintenance is a global process 
and thus, it is very difficult to predict the spreading of software 
changes. 

For example, compare the modular graph in figHJ^ and the 
random graph in figHK- The graph in fig. figHJj displays 
three, highly clustered, modules (i.e, a module is a subset of 
nodes that exchange many more links among them than with 
the rest of the network). Here, modules are interconnected 
to other modules by a single link and thus suggesting that 
internal changes in a module cannot affect other modules. 
On the other hand, fig|5]\ is an example of highly coupled, 
loosely modular architecture. This is a random graph and all 
nodes belong to the same module. Changes in the random 
graph (figHK) are more likely to affect many more nodes 
than changes in the modular graph (fig|5)3). Notice that local 
measures cannot separate random and modular structures (i.e., 
the random graph and the modular graph have the same 
average degree (fc)). This suggests why empirical studies 
do not report significant correlations between local software 
measures and change impact [18]. The state of a class relies on 
the state of all the other classes it references, including these 
classes referenced through a chain of intermmediate classes. 

There is ample evidence that many software projects have 
a natural tendency to become disordered structures [19]. 
This code degradation is often associated with a breakdown 
of modularity that happens when changes are widely dis- 
persed and affect many unrelated classes in apparently distant 
modules [20]. We suggest that such breakdown of modularity 
might be related to the emergence of the small-world behavior. 
Recall that a highly-clustered class graph (i.e., a modular 
graph) becomes a small-world by the addition of a few shortcut 
links between dissimilar nodes ( i.e., a relationship between 
unrelated classes in different software modules). Once the 
system displays small-world behavior, its average path length 
gets near the minimal value drand and the software project 
might be closer to a breakdown of modularity. In this context, 
we propose that software engineers evaluate the risk of code 
degradation by measuring any significant deviation of average 
path length (|5]l. This global measure could be a better indicator 
of code degradation because it takes into account indirect 
effects. 

E. Class Graphs are Scale-Free Networks 

Class graphs are highly heterogenous networks, where a 
very few classes participate in many relationships and the 
majority of classes have one or two relationships [21]. Highly 
connected classes are key software components that keep 
the whole software system as a coherent entity. In this con- 
text, software designs are remarkably similar to many other 
complex networks, like the WWW, the Internet and many 
biological networks [3]. They are all examples of scale-free 
(SF) networks, that is, they have a degree distribution that 
follows a scaling law, P{k) ~ k ''. As shown in figure 
|6] and in table II, class graphs are nice instances of scale- 




C = 0.06 d = 1.92 C = 0.14 d = 2.03 



Fig. 5. Two different graphs with the same number of nodes A'^ = 11 
and links L = 18. (A) Random graph with < k >= L/N Ri 1.63. This 
graph has small average path length d = dj-^nd = 1-92 and low clustering 
C = Crand = 0.06 (B) Modular graph with the same average degree. This 
graph is an small-world because its average path length is compai'able to 
random prediction d = 2.0 ~ d^^nd and its clustering coefficient is about 

twice C = 0.14 ^ 2Crand- 

free (SF) networks. The fact that all the graphs analysed here 
display SF structure, in spite of the obvious differences in size, 
functionality and other features, is an indication that strong 
constraints are at work in software evolution. However, and 
contrary to the small-world feature of class graphs, we suggest 
this scale-free behavior has an exogenous origin (see below). 

The cumulative degree distribution Py{k) reduces noise 
levels during the estimation of the scaling exponent 7, 

P>(fc)=^P(fc') (7) 

k'>k 

If P{k) « then we have P>(fc) w / P{k')dk' « 
^-7+1 xhe exponent 7 is estimated by linear regression in 
the log-log plot (see figure 3b and figure 9). For class graphs 
analyzed here, we obtain 7 « 2.5. On the other hand, in-degree 
and out-degree distributions of directed class graphs also 
follow power-laws, Pm(fc) ~ fc^'^"' and Pout{k) ~ k 
Directed degree distributions display different exponents from 
the undirected version. Typically, we observe 7,„ < 7 and 
lout > 7- In other words, if we look at the number of 
outgoing and incoming links, the resulting degree distributions 
are different. The in-degree distribution has a clear power-law 
tail while the out-degree distribution decays much faster. A 
similar pattern has been observed in the web graph [22]. An 
extensive study of the entire WWW in October 1999 used 
the webcrawl from Altavista to obtain empirical in-degree and 
out-degree distributions for a subset of the full web graph[23]. 
They have shown that in- and out-degree distributions of the 
web graph are fitted by scaling laws with exponents 7,;„ — 2.1 
and 7o„t = 2.7. These exponents are very close to the average 
in-degree exponent (7m} = 2.2 and the average out-degree 
exponent (jout) — 2.8 taken over all class graphs in table II. 

F. Related Work 

In order to measure software cohesion and coupling, [24] 
proposed to represent software designs with graphs, where 
nodes represent software entities and edges are relationships 
between entities. In this framework, a software module is a 




Fig. 6. (Top) Cumulative degree distributions for several class graphs: eMule 
(N=129. triangles). Blender (N=495, squares) and CS (N=1488, circles). All 
distributions have an exponent about -2.5 in spite of the obvious differences 
in size and functionality. (Bottom) Asymmetry of in-degree (open circles) 
and out-degree (filled circles) distributions for ProRally 2002. The in-degree 
distribution is the probability that a given component is reused by ki^ other 
components. Conversely, the out-degree distribution is the probability that a 
component uses kout other components. Notice how outdegree distribution 
shows a shaip cutoff. 



subset of nodes (or subgraph) more densely connected than 
with the rest of the network. [24] explores what are the de- 
sirable properties of any cohesion and coupling measurement. 
At the coarsest level of description of a software system (or 
software architecture), a measure of coupling is the number of 
edges exchanged between modules. The cohesion of a module 
comprising k elements scales with the ratio 2t/k{k — l) where 
t is the number of edges within the module. These coupling 
and cohesion definitions correlate with the number of edges 
L and local clustering Ci respectively. In this context, 
a statistically valid model of class graph will provide useful 
estimators of coupling and cohesion in real systems. 

Measurements are needed to assess the best design solution 
among different alternatives. From all the candidates in the 
solution space, we want to pick the one with the highest 
metric value [25]. It has been proposed that change impact 
defines one of the axes of this solution space. Some graph 
measurements from the logical structure of OO systems can 
be used as (static) estimators of change impact. A related 
definition is alteration visibility or the size of the set of classes 
affected by the change to a single class [26]. A more detailed 
approach to impact analysis uses approximate algorithms to 
compute ripple effects from low-level source code features 
(see [27] [28]). A distance measure very similar to path length 
was used in [26] in order to select the best choice when 
restructuring large software designs. However, [26] proposed 




class vl 
i n t X ; 

lass v2 : public vl 
void m2 ( ) ; 

lass v3 : public vl 
void m3 (vl __a, v2 __b) , 

lass v4 
void m4 ( v2 a) ; 



Fig. 7. (a) An example of C++ code and its: (b) bipartite association graph B, (c) its class projection By, (d) its method projection Bu and (e) its class 
graph G. 



TABLE II 

Exponents of Cumulative Degree Distributions 



Dataset 


Degree 


In-degree 


Out-de 


gree 


Mudsi 


1.74 


± 





04 


1 


20 


± 





08 


2.00 







05 


JDK-B 


1.55 


± 





08 


I 


39 


± 





05 


2.30 


± 





14 


JDK-A 


1.41 


± 





02 


I 


18 


± 





02 


2.39 







14 


Prorally 


1.72 


± 





03 


1 


44 


± 





02 


1.88 







10 


Striker 


1.70 


± 





04 


1 


54 


± 





03 


1.73 


± 





06 


gchempaint 


1.63 


± 





31 


1 


11 


± 





35 


1.41 


± 





12 


4yp 


1.54 


± 





09 


1 


30 


± 





05 


1.59 


± 





18 


Prospectus 


1.67 


± 





09 


1 


13 


± 





09 


1.92 


± 





27 


eMule 


1.58 


± 





03 


1 


51 


± 





07 


1.42 


± 





08 


Aime 


1.43 


± 





05 


1 


30 


± 





04 


1.48 







07 


Openvrml 


1.34 


± 





06 





94 


± 





05 


1.59 







23 


gpdf 


1.64 


± 





11 


I 


23 


± 





10 


1.76 


± 





17 


Bochs 


1.37 


± 





08 


1 


17 


± 





09 


1.64 


± 





20 


Quanta 


1.69 


± 





10 


1 


55 


± 





13 


1.87 


± 





13 


Fresco 


1.66 


± 





09 


I 


14 


± 





10 


1.76 


± 





19 


Freetype 


1.65 


± 





07 


1 


42 


± 





04 


1.82 


± 





16 


Yahoopops 


1.67 


± 





05 


1 


46 







06 


1.69 







05 


Blender 


1.64 


± 





04 


I 


36 


± 





05 


2.04 


± 





09 


GTK 


1.51 


± 





04 


1 


22 


± 





02 


2.38 


± 





20 


OIV 


1.43 


± 





02 


1 


14 


± 





03 


2.10 







12 


wxWindows 


1.41 


± 





03 


1 


11 


± 





02 


2.18 


± 





12 


CS 


1.58 


± 





02 


1 


22 


± 





03 


1.96 


± 





09 



to measure path length between classes belonging to a single 
module (i.e, intra-distance) while here we propose to measure 
path length between all the classes in the class graph (i.e, 
inter-distance). 

Chidamber and Kemerer (C&K) presented a suite of object 
oriented metrics [29] that seems to be related to our suite of 
graph measurements in class graphs . Several histograms of 
C&K metrics display highly skewed distributions, i.e., fig. 2 
for the WMC (Weighted Methods per Class) metric, fig. 14 
for the CBO (Coupling between Object Classes) metric and 
fig. 16 for RFC (Response for a Class) metric in [29]. These 
histograms resemble power-laws like the degree distribution of 
class graphs (see previous subsection). Unfortunately, while 
C&K metrics appear to be related to some of our graph 



measurements, no further comparison is possible because 
no regression analysis of histograms is available from their 
study. Still, C&K made a qualitative interpretation of extreme 
metric values in terms of "outliers" [29], which provides some 
evidence of heterogeneous software metrics. 

In this context, there is a close relationship between depth of 
inheritance tree (or DIT, see [29]) and degree of difficulty in 
understanding and comprehending the organization of object- 
oriented systems. Dvorak claimed that "the deeper the level 
of the hierarchy, the greater the probability that a subclass 
will not consistently extend and/or specialize the concept of its 
superclass" [30], that is, excessively deep class hierarchies are 
complex to develop. Evidence of positive correlation between 
DIT and the likelyhood of faults in GO systems is given in 
[31]. We should expected some correlation between DIT and 
average path length d of class graphs because inheritance tree 
is a subset of the class graph, suggesting how DIT might 
be closely related to the small-world. Very low values of 
DIT found in the study of Li and Henry [32] provide some 
empirical support to this hypothesis. 

III. Class-Method Association Graphs 

We have shown that class graphs are small-worlds. Here, we 
investigate the possibility that small-world can spontaneously 
emerge in an evolving GG software system. We provide em- 
pirical and theoretical support to this conjecture by modeling 
the hierarchical structure of GG software with a bipartite 
association between classes and methods. In Java and C++, 
we conceive a software system as a set of interrelated classes. 
These classes are further decomposed into data members (or 
variables) and code methods (a method is the GG equivalent 
of subroutines in Fortran, C or Pascal). This hierarchical or- 
ganization of GG software can be represented with a bipartite 
association graph B ~ {V,U,E) where V — {vi} is the 
set of classes and U = {rrii} is the set of methods and 
E = {{vi,mi}} is the set of dependencies between classes 
and methods. Also, N = \V\ is the number of classes and 



M = \U\ is the number of methods. We have an edge 
{vi,mj} G E when class Vi appears in the parameter list 
of method rrij. In addition, a class is always a parameter of 
its own collection of methods (i.e, self or this keyword). We 
can recover this bipartite graph with a simple algorithm (see 
Appendix II). Figure |7] illustrates a small C++ code (see fig. 
|7^) and its corresponding bipartite association graph (see fig. 

Wp). 

We define the (discrete) generating functions /i(n) and ^{n) 
for the bipartite graph: 



Mri)=^fc"P„(fc) 



and 



fe 



(8) 



(9) 



where n = 1,2,... and Pu{k) is the fraction of U nodes 
having k edges and Pv{k) is the fraction of V nodes having 
k edges. First moments /i = and ly — indicate 

the average method degree and the average class degree, 
respectively. It is easy to check that Mv — Nfj,. 

The one-mode projection (or unipartite) network expresses 
connections between nodes of the same kind (see fig. |7};,d). 
We have two one-mode projections i?„ = {V,Ey) (i.e., so- 
called class projection) and Bu = {U,Eu) (i.e., so-called 
method projection) from the bipartite association method-class 
graph. Formally, we define A as the adjacency matrix of the 
bipartite network B, where Aij = 1 if {vi,Ui} G E and 
Aij — otherwise. The adjacency matrix A^ for the one- 
mode projection B^ is related to the adjacency matrix A by: 



A 



jk 



(10) 



A similar relation holds between the adjacency matrix A^ 
of projection Bu and the adjacency matrix A of the bipartite 
network B: 



A" 



^AmA 



kj 



(11) 



Netwman et al. have shown that one-mode projections 
must be small-worlds even when the bipartite association is 
random[33]. Social networks display high-clustering coeffi- 
cients because agents follow a natural tendency to group 
together in communities. Moreover, the addition of a few 
shortcuts between distant agents in clustered communities 
yields to small average path lengths. Assuming that bipartite 
association B is random, we have that the average path length 
between two classes in By will be very small. 



d{By) = 



\ogN 
log 2; 



(12) 



and correspondingly for the method projection i?„ we have, 

logM 



diBu) 



log z 



(13) 
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Fig. 8. Comparison between local and global measurements in class graphs 
and class projections. Every point represents the pair of graphs (G, By) 
from a single software system. (A) The clustering coefficient in class graphs 
scales linearly (fitting slope = 0.95 ~ 1) with clustering coefficient in class 
projections. (B) Global efficiency in class graphs scales linearly (fitting slope 
= 1.15 ~ 1) with global efficiency in class projections. Good agreement 
suggests that statistical features of class graphs, like the small-world property, 
may be inherited from class projections. 



where z = fiv is the expected average degree for the one- 
mode projection. The clustering coefficient for the one-mode 
projection By will be very high: 



and for B^, 



C{By) = 



C{By) 



1 



1 



(14) 



(15) 



Then, the above suggests that partitioning a OO software 
system into classes and methods is very likely to result into 
a highly-clustered software structure with small average path 
lengths. Moreover, this seems to be largely independent of 
the specific association between methods and classes. The 
clustering coefficient and mean path length only depend on the 
average connectivity of classes and methods. The small-world 
behavior of class graphs does not appear to be an additional 
requirement selected by software engineers but an unavoidable 
feature associated to the hierarchical nature of OO software 
systems. 

A. Comparison between Class Graphs and Class Projections 

Beyond specific topological patterns displayed in any OS 
software system we can investigate how well (in a statistical 
sense) the class projection explains structural patterns dis- 
played by the class graph. Here, we are more interested in the 
structural properties of the average OS software system. In this 
context, we have found very good agreement between local 
and global measures of class graphs and class projections. In 
order to enable a meaningful comparison between the class 
projection By and the class graph G, we must ensure they 
have the same number of nodes and links. First, we obtain 
the filtered class graph G = {V , E) by removing G nodes 
without edges in By = {V,Ey). On the hand, the class 
projection By often displays more edges than the filtered class 
graph G, that is, \Ey\ > \E\. Then, we remove a fraction 
p = 1 — |i^|/|£'t, I from the original class projection to obtain 



the filtered class projection By ~ {V,E.i,) having the same 
number of nodes and edges in the filtered class graph G. 
For the systems analyzed here, the average edge removal 
probability is (p) « 0.54. 

Clustering coefficient of (filtered) class graphs scales almost 
linearly with clustering coefficient measured in the (filtered) 
class projections (see fig. [HK): 

C(G) = 0.92C(S„) ±0(1) (16) 

Edge removal can leave disconnected nodes in the class 
projection. Average path length d cannot be computed in 
disconnected networks because dij = oo. Fortunately, we 
can use the global efficiency Egiob measure that is formally 
equivalent to average path length. Global efficiency of an 
undirected graph G is defined as follows [4]: 

where < Egiob{G) < 1. Note that the maximum value 
of global efficiency is attained when G is the complete graph 
having N{N — l)/2 possible edges and the minimum value 
indicates that G has no edges, i.e., the graph is completely 
disconnected. We have found that global efficiency in class 
graphs scales with global efficiency of class projections: 

EgiobiG) = l.l5Egiob{By) ± 0(1) (18) 

Good agreement of local and global measurements in the 
class graph G and the class projection provides support that 
the SW behavior of class graphs is an invariant feature of any 
OO software system. OO programming requires that related 
code and data cluster together in the same class, and thus 
resulting in high clustering coefficients. However, methods 
cross class boundaries when they use data (and methods) 
from other external classes. These eventual interactions among 
unrelated software entities yield small average path lengths. 

B. Case Study: Stellarium 

We illustrate a detailed comparison between class graph and 
class projection (see above) with the OS software Stellarium 
(http://stellarium.free.fr). This comparison suggests how useful 
is to analyze and compare several network representations 
during reverse engineering and program comprehension. Stel- 
larium is written in C++ and computes the position of stars 
and other space bodies in real-time. Figure |9] shows the largest 
connected components of class graph G and class projection 
By recovered from the C++ source code our reconstruction 
algorithm. Looking at the class graph (see fig|9]\) we no- 
tice two well-defined communities or modules in Stellarium. 
Comparison between the class graphs and the class projection 
indicates that modularity is preserved across multiple levels 
of the software hierarchy (see fig|9j3). Indeed, every software 
system analyzed here follows this pattern: global features are 
preserved across levels while individual nodes might play 
different roles depending on the network representation. Table 




Fig. 9. Comparison between the (A) class graph G and the (B) class 
projection for the OS software Stellarium. These graphs have the same 
size A'^ = 101 and number of edges L = 162. Here we display the 
largest connected component of each graph, thus explaining the apparent size 
difference. These networks have different adjacency matrices but identification 
of nodes (displayed with solid black circles) in G and in Bv suggests they 
have the same modular structure. Notice the class s_f ont is at the boundary 
separating the two main modules. 



Ill and table IV summarize individual node measurements 
(classes are highlighted with solid circles in figure |9]l. 

For instance. Class Projector belongs to the same mod- 
ule in G and in By. However, Projector is a hub in 
By but has few connections in G (see fig|9]l. An opposite 
example is the class stel.core, which is a hub in G but 
has only two connections in By (in fact, this node belongs to 
a disconnected subgraph not shown in the above figure). Class 
stel_core relies in many other classes (fco«t = 22) and is 
the main application dispatcher in Stellarium. stel_core is 
the starting point of many code reviews and thus, is frequently 
visited by Stellarium engineers. Consequently, this class dis- 
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TABLE IV 
Node Measurements in B, 
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plays the highest centrality value BC — 0.03, measured as 
the total number of shortest paths passing through a node 
[4]. The second largest centrality value BC = 0.012 is 
displayed by class component representing the base class 
of any user interface control in Stellarium. On the other hand, 
class Projector is an instance of the state design pattern 
[34] and keeps the graphical application state (i.e., projection 
matrix, observer coordinates, etc.). Projector is referenced 
by many methods, explaining why it has many connections 
(k = 21) in By. 

IV. Scale-free and Evolution Constraints 

Scale-free networks can be obtained with simple generative 
models of network evolution. These models have two main 
components: network growth and preferential attachment [35]. 
This suggests that scale-free nature of OO software systems 
stems from the evolutionary process. However, the preferential 
attachment model fails to reproduce the structure of class 
graphs. For example, the predicted exponent 7 = 3 for the 
model is different from the observed exponent 7 w 2.5. In 
addition, the clustering coefficient for preferential attachment 
is very low and thus, cannot explain why class graphs are 
highly clustered. 

Following an approach similar to other empirical studies 
of software evolution [36], we have collected longitudinal 
data from the evolution of ProRally 2002, a large, priopietary 
computer game from Ubi Soft that was developed by 20 soft- 
ware engineers (see fig|2|. We have recovered 176 intermediate 
class graphs comprising two years of development. From this 
dataset, we have analyzed time series of the number of nodes 



N{t), number of links L{t) and average path length d{t). The 
growth pattern followed by ProRally 2002 was approximately 
linear in N and L (consistently with the general observation 
made in section IV. C) and the final class graph has TV = 1993 
and L = 4987 links (see fig. |2]). Table I and II reports some 
network measurements for the final ProRally 2002 class graph. 
The time evolution of average path length d in ProRally 2002 
quickly saturates with « 5 after a brief transient (see fig.fTOb. 
This constant growth pattern in the time evolution of ProRally 
2002 yields an heterogeneous P{k). As shown by Puniyani 
and Lukose, growing random networks under the constraint of 
constant diameter must display scale-free architecture and with 
a scaling exponent 7 G [2,3] [37]. Specifically, they found 
that: 

P{k) « k^~f (19) 

where a < 1 is an exponent relating network size N with 
degree fluctuations: 

= -!- I k^P(k)dk (20) 
{k) Jk 

and (3 is an exponent linking the degree cutoff kc (see 
subsection II. D) with network size N, i.e., 

fee « Nf" (21) 

Using our dataset, we estimate f3 — 0.62 ± 0.09 and a = 
0.42 ±0.08. This predicts the scaling exponent 7 « 2.59 to be 
compared with the average exponent over all systems (7) = 
2.57 ± 0.07 (computed from table II). The scaling law in the 
cutoff kc{N) allows us to provide an analytic calculation of 
the scaling between L and N. The following integral gives the 
general relationship between L and N: 



L^N J kP{k)dk (22) 


Here, we have, 

hue 

in very good agreement with the exponent obtained in section 
II. E for class graphs. Keeping the average path length constant 
during class graph evolution yields an heterogeneous degree 
distribution with the observed exponent. However, we were 
unable to find any intrinsic explanation to this constraint. 
A possible explanation is an exogenous pressure related to 
communication constraints in distributed software teams [38] 
(see below). 
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Fig. 10. Evolution of average path length during ProRally 2002 development. 

V. Conclusion 

The structure and operation of software systems can be 
studied at different levels of organization, from the small and 
simple (instructions) to the large and complex (the modular 
architecture). The signature of complex software organization 
is an heterogeneous and hierarchical network. This pattern 
partly explains why it is difficult to find a clear, nice de- 
composition of software systems. Moreover, the role of broad 
distributions in software measurements have been largely 
dismissed. Researchers treat these distributions like normally 
distributed [7]. Such transformation losses a significant amount 
of information and hides the true nature of software . Instead, 
we must address heterogeneous distributions with appropriate 
tools. This knowledge could be crucial to develop future 
software systems. For example, degree distributions predict 
how many classes have more than, say, a hundred, data 
members in a future class diagram of doubled size. Notice 
that using non power-law expressions of degree distribution 
inevitably yield inaccurate predictions. In conclusion, we must 
abandon reductionistic descriptions of OO systems and replace 
this view by large-scale statistical characterizations preserving 
the structural variability. 

A more general question concerns the uselfuness of single- 
valued metrics. For example, the distribution of class sizes 
(measured as number of lines of code, NLOC) encodes more 
information than the integral value or system size. Given a 
NLOC value (say, lOKLOC), there are many distributions 
satisfying this integral value. That is, NLOC is an ambiguous 
measure that provides less information than the original size 
distribution. We can compute average path length and the 
average clustering coefficient from probability distributions of 
basic graph metrics, i. e. connectivity. There is an important 
source of information in the probabiUty distributions of soft- 
ware measurements. 

On the other hand, large-scale software development is a 
social task. Interaction between software engineers might have 
an influence in the organization and structuring of source code 
bases. For example, open-source developments are geograph- 



ically distributed. These software teams face pressing com- 
munication and coordination problems that require specific 
software structures according to the social organization: "or- 
ganizations which design systems are constrained to produce 
designs which are copies of the communications structure of 
these organizations" (Conway's Law)[38]. Under this social 
perspective, software is viewed simultaneously as the product 
and the vehicle that enables efficient communication between 
software engineers, i.e, the communication medium. Separated 
software modules minimize communication overheads, which 
is the bottleneck in distributed software developments [39]. In 
this context, it should be interesting to study how the patterns 
described here relate to distributed software development. 

Finally, while our study focused on static analyses of source 
code, recent studies on object graphs have revealed similar 
patterns in the distribution of run-time connections between 
objects in several programs [40]. This similarity suggests a 
link between structural and dynamical features of OO software 
that should be investigated in future studies. 

Appendix I 
Class graph reconstruction algorithm 

The following algorithm reconstructs the class graph from a 
collection of Java/C++ header files (comments are highlighted 
in italics). The class Digraph implements a directed graph. 
The method Digraph: :AddLink (cl, c2) tests if class 
names cl and c2 have been already inserted in the graph. 
If not, they are inserted correspondingly. We discard repeated 
hnks (ci,C2). There is distinction between public, private or 
protected attributes. Finally, the algorithm outputs the directed 
(and imdirected) network versions to a file. 

Digraph D; // class graph D = {C,L) 
String cl, c2; // class names 
FOR every header file DO 

WHILE (not end of file) DO 
// Find class declaration 
Look for 'class' keyword; 
cl = get_class_name ( ) ; //ci €C 
// Test if inheritance relationship ("is a") 
IF (next sequence is ' : public' ) THEN 
c2 = get_parent_class () ; //c2 £ C 
D.AddLink(cl, c2) ; //{ci,C2)€L 
ENDIF 

//get attributes ("has a") 

WHILE (not end of class) DO 

Look for attribute declaration; 
c2 = get_attribute_class ( ) ; //c2 € C 
D.AddLink (cl, c2) ; //(ci,C2) £ L 
END 
END 
END 

D. Output ; 

Appendix II 

Association graph reconstruction algorithm 

The following algorithm recovers the bipartite association 
class-method graph from a collection of C++ header files. 
The class Bipartite implements a bipartite graph. There 



is a method Bipartite :: AddLink (u, v) that checks if 
method u and class v have been already inserted in the graph. 
We assume that methods have unique identifiers. Methods 
having the same name can still be differentiated because they 
belong to different classes (and classes cannot have the same 
name). 

Bipartite B; // association graph B — {V,U, E) 
String vl, v2; // class names 
String u; // method name 
FOR every header file DO 

WHILE (not end of file) DO 
// Find class declaration 
Look for 'class' keyword; 
vl = get_class_name ( ) ; //viGV 
WHILE (not end of class) DO 
Look for method declaration; 
u = get_method_name ( ) ; //u£U 
D.AddLink(u, vl); //{u,vl}eE 
WHILE (not end of method) DO 

v2 = get_parameter_class ( ) ; //v2(zV 
D. AddLink (u, v2); //{u,v2}€E 
END 
END 
END 
END 

B. Output ; 
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